Collecting and mapping tweets in R

## [1] "2.4-collecting_tweets.Rmd"
Data/tweet_searches/2.3-DubNtnStrngthNmbrs-2018-06-09-21-39-28.rds
Data/tweet_searches/2.3-DubTweetsGame3-2018-06-06.rds
Data/tweet_searches/2.3-TweetsDubNtnStrngthNmbrs-2018-06-09-22-04-30.rds
Data/tweet_searches/2.3-UsersDubNtnStrngthNmbrs-2018-06-09-21-46-45.rds

Motivation

This post will walk you through 1) collecting data from a Twitter API using the rtweet package, 2) creating a map with the tweets using the ggmap, maps, and mapdata, and 3) graphing the tweets with You can find excellent documentation on the package website, I am just going to go into more detail.

Set up the twitter app (with rtweet)

Install/load the package

This is the first step for collecting tweets based on location. See the vignette here. I’ve outlined this process in the link below.

rtweet_setup

rtweet_setup

Searching for word occurances in tweets

We will start by collecting data on a certain hashtag occurrence. When I am writing this, it is game three of the NBA finals, so I will search for the hastag #DubNation. The function for collecting tweets is rtweet::search_tweets(), and it takes a query q (our term).

Learn more about this function by typing:

We will use all the default settings in this inital search.

After the rtweet::search_tweets() function has run, I will take a look at this data frame with dplyr::glimpse()

Observations: 15,004
Variables: 87
$ user_id                 <chr> "1000036251272187904", "10002061971721216...
$ status_id               <chr> "1004574929558257665", "10045813096442019...
$ created_at              <dttm> 2018-06-07 04:05:01, 2018-06-07 04:30:23...
$ screen_name             <chr> "hsn_sports", "MahmoudZaytoon8", "Wholelo...
$ text                    <chr> "\U0001f3c0 Golden State Warriors, Kevin ...
$ source                  <chr> "Twitter for Android", "Twitter for Andro...
$ display_text_width      <dbl> 168, 140, 139, 143, 35, 72, 101, 45, 98, ...
$ reply_to_status_id      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ reply_to_user_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ reply_to_screen_name    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ is_retweet              <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRU...
$ favorite_count          <int> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,...
$ retweet_count           <int> 0, 1210, 1816, 1393, 0, 8963, 508, 1, 1, ...
$ hashtags                <list> [<"DubNation", "NBAFinals">, <"NBAFinals...
$ symbols                 <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ urls_url                <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ urls_t.co               <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ urls_expanded_url       <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ media_url               <list> ["http://pbs.twimg.com/tweet_video_thumb...
$ media_t.co              <list> ["https://t.co/08UGymu2xy", NA, NA, NA, ...
$ media_expanded_url      <list> ["https://twitter.com/hsn_sports/status/...
$ media_type              <list> ["photo", NA, NA, NA, "photo", "photo", ...
$ ext_media_url           <list> ["http://pbs.twimg.com/tweet_video_thumb...
$ ext_media_t.co          <list> ["https://t.co/08UGymu2xy", NA, NA, NA, ...
$ ext_media_expanded_url  <list> ["https://twitter.com/hsn_sports/status/...
$ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ mentions_user_id        <list> [NA, "19923144", <"19923144", "26270913"...
$ mentions_screen_name    <list> [NA, "NBA", <"NBA", "warriors">, <"NBA",...
$ lang                    <chr> "tr", "en", "en", "en", "und", "und", "en...
$ quoted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_text             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_created_at       <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_favorite_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_retweet_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_user_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_screen_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_followers_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_friends_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_statuses_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_description      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_verified         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ retweet_status_id       <chr> NA, "1004445319499845632", "1004567827133...
$ retweet_text            <chr> NA, "Get pumped for Game 3... with Steph ...
$ retweet_created_at      <dttm> NA, 2018-06-06 19:30:00, 2018-06-07 03:3...
$ retweet_source          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ retweet_favorite_count  <int> NA, 3967, 4078, 3501, NA, 13970, 1689, NA...
$ retweet_retweet_count   <int> NA, 1210, 1816, 1393, NA, 8963, 508, NA, ...
$ retweet_user_id         <chr> NA, "19923144", "19923144", "19923144", N...
$ retweet_screen_name     <chr> NA, "NBA", "NBA", "NBA", NA, "warriors", ...
$ retweet_name            <chr> NA, "NBA", "NBA", "NBA", NA, "Golden Stat...
$ retweet_followers_count <int> NA, 27819753, 27819758, 27819758, NA, 583...
$ retweet_friends_count   <int> NA, 1664, 1664, 1664, NA, 987, 1664, NA, ...
$ retweet_statuses_count  <int> NA, 201209, 201209, 201209, NA, 76309, 20...
$ retweet_location        <chr> NA, "", "", "", NA, "Oakland, CA", "", NA...
$ retweet_description     <chr> NA, "30 teams, 1 goal. #ThisIsWhyWePlay",...
$ retweet_verified        <lgl> NA, TRUE, TRUE, TRUE, NA, TRUE, TRUE, NA,...
$ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ geo_coords              <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>,...
$ coords_coords           <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>,...
$ bbox_coords             <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <NA, ...
$ name                    <chr> "HSN Sports", "Mahmoud Zaytoon", "Wholelo...
$ location                <chr> "", "الأسكندرية, مصر", "Fort Myers, FL", ...
$ description             <chr> "", "Ahly Barça Liverpool ❤", "Sc\U0001f4...
$ url                     <chr> NA, NA, NA, NA, "https://t.co/4NGzgl9FKn"...
$ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ followers_count         <int> 10, 27, 8, 8, 22, 1, 4, 313, 313, 20, 20,...
$ friends_count           <int> 104, 268, 48, 48, 90, 7, 144, 156, 156, 7...
$ listed_count            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ statuses_count          <int> 41, 18, 8, 8, 320, 4, 1, 335, 335, 166, 1...
$ favourites_count        <int> 0, 138, 0, 0, 328, 3, 5, 98, 98, 163, 163...
$ account_created_at      <dttm> 2018-05-25 15:29:56, 2018-05-26 02:45:15...
$ verified                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ profile_url             <chr> NA, NA, NA, NA, "https://t.co/4NGzgl9FKn"...
$ profile_expanded_url    <chr> NA, NA, NA, NA, "http://www.instagram.com...
$ account_lang            <chr> "tr", "en", "en", "en", "en", "en", "en",...
$ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/10...
$ profile_background_url  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/1000...

This data set contains 15,004 observations. If I want more tweets, I need to adjust the cap on the number of tweets I can collect with my API. I can do this by setting the retryonratelimit to TRUE.

See below from the manual:

Logical indicating whether to wait and retry when rate limited. This argument is only relevant if the desired return (n) exceeds the remaining limit of available requests (assuming no other searches have been conducted in the past 15 minutes, this limit is 18,000 tweets). Defaults to false. Set to TRUE to automate process of conducting big searches (i.e., n > 18000). For many search queries, esp. specific or specialized searches, there won’t be more than 18,000 tweets to return. But for broad, generic, or popular topics, the total number of tweets within the REST window of time (7-10 days) can easily reach the millions.

Collect data for #DubNation and #StrengthInNumbers with rtweet::search_tweets2()

The rtweet::search_tweets2() function works just like the rtweet::search_tweets(), but also “returns data from one OR MORE search queries.

I’ll use rtweet::search_tweets2() to collect data for two hashtags now, #DubNation and #StrengthInNumbers, but set the n to 50000 and the retryonratelimit argument to TRUE.

Collect data for #DubNation and #StrengthInNumbers with rtweet::search_tweets2()

The rtweet::search_tweets2() function works just like the rtweet::search_tweets(), but also “returns data from one OR MORE search queries.

I’ll use rtweet::search_tweets2() to collect data for two hashtags now, #DubNation and #StrengthInNumbers, but set the n to 50000 and the retryonratelimit argument to TRUE.

The structure for this data frame is displayed below with dplyr::glimpse()

Observations: 56,966
Variables: 88
$ user_id                 <chr> "1000031071675744256", "10000310716757442...
$ status_id               <chr> "1005524470843441153", "10055250664933744...
$ created_at              <dttm> 2018-06-09 18:58:10, 2018-06-09 19:00:32...
$ screen_name             <chr> "Gracinha_pxt", "Gracinha_pxt", "Gracinha...
$ text                    <chr> "RT @warriors: #DubNation, your 2018 Cham...
$ source                  <chr> "Twitter for iPhone", "Twitter for iPhone...
$ display_text_width      <dbl> 140, 103, 63, 140, 77, 140, 140, 139, 123...
$ reply_to_status_id      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ reply_to_user_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ reply_to_screen_name    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ is_retweet              <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,...
$ favorite_count          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ retweet_count           <int> 1136, 92, 148, 1, 6, 1136, 1136, 1012, 14...
$ hashtags                <list> ["DubNation", <"DubNation", "NBAFinals",...
$ symbols                 <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ urls_url                <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ urls_t.co               <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ urls_expanded_url       <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ media_url               <list> [NA, "http://pbs.twimg.com/media/DfOVWIN...
$ media_t.co              <list> [NA, "https://t.co/kSbS11eZxy", "https:/...
$ media_expanded_url      <list> [NA, "https://twitter.com/GloballyCurry3...
$ media_type              <list> [NA, "photo", "photo", NA, "photo", NA, ...
$ ext_media_url           <list> [NA, "http://pbs.twimg.com/media/DfOVWIN...
$ ext_media_t.co          <list> [NA, "https://t.co/kSbS11eZxy", "https:/...
$ ext_media_expanded_url  <list> [NA, "https://twitter.com/GloballyCurry3...
$ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ mentions_user_id        <list> ["26270913", "735871787875991553", "7358...
$ mentions_screen_name    <list> ["warriors", "GloballyCurry30", "Globall...
$ lang                    <chr> "en", "en", "und", "pt", "en", "en", "en"...
$ quoted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_text             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_created_at       <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_favorite_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_retweet_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_user_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_screen_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_followers_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_friends_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_statuses_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_description      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ quoted_verified         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ retweet_status_id       <chr> "1005498277033226241", "10053120874077143...
$ retweet_text            <chr> "#DubNation, your 2018 Champs \U0001f3c6 ...
$ retweet_created_at      <dttm> 2018-06-09 17:14:05, 2018-06-09 04:54:14...
$ retweet_source          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ retweet_favorite_count  <int> 5080, 316, 397, 1, 15, 5080, 5080, 4685, ...
$ retweet_retweet_count   <int> 1136, 92, 148, 1, 6, 1136, 1136, 1012, 14...
$ retweet_user_id         <chr> "26270913", "735871787875991553", "735871...
$ retweet_screen_name     <chr> "warriors", "GloballyCurry30", "GloballyC...
$ retweet_name            <chr> "Golden State Warriors", "Team Wardell SC...
$ retweet_followers_count <int> 5863529, 44051, 44051, 70, 90, 5863529, 5...
$ retweet_friends_count   <int> 986, 4933, 4933, 84, 121, 986, 986, 26, 1...
$ retweet_statuses_count  <int> 76452, 6888, 6888, 189, 25, 76452, 76452,...
$ retweet_location        <chr> "Oakland, CA", "World Wide - Warriors", "...
$ retweet_description     <chr> "\U0001f3c6\U0001f3c6\U0001f3c6\U0001f3c6...
$ retweet_verified        <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, T...
$ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ geo_coords              <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>,...
$ coords_coords           <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>,...
$ bbox_coords             <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <NA, ...
$ name                    <chr> "Karen Peixoto", "Karen Peixoto", "Karen ...
$ location                <chr> "Santa Cruz, Rio de Janeiro", "Santa Cruz...
$ description             <chr> "Twitter novo, o outro foi bloqueado. Jes...
$ url                     <chr> NA, NA, NA, NA, NA, "https://t.co/fbyfQas...
$ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ followers_count         <int> 129, 129, 129, 50, 86, 63, 1, 10, 2, 11, ...
$ friends_count           <int> 127, 127, 127, 68, 91, 161, 20, 235, 102,...
$ listed_count            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
$ statuses_count          <int> 340, 340, 340, 128, 191, 1844, 29, 12, 10...
$ favourites_count        <int> 1556, 1556, 1556, 60, 55, 744, 9, 6, 13, ...
$ account_created_at      <dttm> 2018-05-25 15:09:22, 2018-05-25 15:09:22...
$ verified                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ profile_url             <chr> NA, NA, NA, NA, NA, "https://t.co/fbyfQas...
$ profile_expanded_url    <chr> NA, NA, NA, NA, NA, "http://Instagram.com...
$ account_lang            <chr> "en", "en", "en", "pt", "pt", "en", "zh-T...
$ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/10...
$ profile_background_url  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "...
$ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/1005...
$ query                   <chr> "\"#DubNation\"", "\"#DubNation\"", "\"#D...

This data frame has 56,966 observations, and adds one additional variable. We can use the handy base::setdiff() to figure out what variables are in DubNtnStrngthNmbrs that aren’t in DubTweetsGame3.

[1] "query"

The query variable contains our two search terms.

# A tibble: 2 x 2
  query                  n
  <chr>              <int>
1 "\"#DubNation\""   34349
2 #StrengthInNumbers 22617

Get user data with rtweet::users_data()

The previous data frame had 87 variables in it, which includes the variables on users and tweets. We can use the rtweet::users_data() function to remove the users variables.

The base::intersect() function allows us to see what variables from DubNtnStrngthNmbrs will end up in the results from rtweet::users_data().

I added tibble::as_tibble() so the variables print nicely to the screen.

# A tibble: 20 x 1
   value                 
   <chr>                 
 1 user_id               
 2 screen_name           
 3 name                  
 4 location              
 5 description           
 6 url                   
 7 protected             
 8 followers_count       
 9 friends_count         
10 listed_count          
11 statuses_count        
12 favourites_count      
13 account_created_at    
14 verified              
15 profile_url           
16 profile_expanded_url  
17 account_lang          
18 profile_banner_url    
19 profile_background_url
20 profile_image_url     

I’ll store the contents in a new data frame called UsersDubNtnStrngthNmbrs.

Observations: 56,966
Variables: 20
$ user_id                <chr> "1000031071675744256", "100003107167574425...
$ screen_name            <chr> "Gracinha_pxt", "Gracinha_pxt", "Gracinha_...
$ name                   <chr> "Karen Peixoto", "Karen Peixoto", "Karen P...
$ location               <chr> "Santa Cruz, Rio de Janeiro", "Santa Cruz,...
$ description            <chr> "Twitter novo, o outro foi bloqueado. Jesu...
$ url                    <chr> NA, NA, NA, NA, NA, "https://t.co/fbyfQas8...
$ protected              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
$ followers_count        <int> 129, 129, 129, 50, 86, 63, 1, 10, 2, 11, 6...
$ friends_count          <int> 127, 127, 127, 68, 91, 161, 20, 235, 102, ...
$ listed_count           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
$ statuses_count         <int> 340, 340, 340, 128, 191, 1844, 29, 12, 102...
$ favourites_count       <int> 1556, 1556, 1556, 60, 55, 744, 9, 6, 13, 5...
$ account_created_at     <dttm> 2018-05-25 15:09:22, 2018-05-25 15:09:22,...
$ verified               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
$ profile_url            <chr> NA, NA, NA, NA, NA, "https://t.co/fbyfQas8...
$ profile_expanded_url   <chr> NA, NA, NA, NA, NA, "http://Instagram.com/...
$ account_lang           <chr> "en", "en", "en", "pt", "pt", "en", "zh-TW...
$ profile_banner_url     <chr> "https://pbs.twimg.com/profile_banners/100...
$ profile_background_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "h...
$ profile_image_url      <chr> "http://pbs.twimg.com/profile_images/10056...

Get tweet data with rtweet::tweets_data()

I can also create another data frame with the tweet information using the rtweet::tweets_data() function. Just like above, I will display the variables in this new data frame (but limit it to the top 20).

I will store these variables in the TweetsDubNtnStrngthNmbrs data frame.

# A tibble: 20 x 1
   value               
   <chr>               
 1 user_id             
 2 status_id           
 3 created_at          
 4 screen_name         
 5 text                
 6 source              
 7 display_text_width  
 8 reply_to_status_id  
 9 reply_to_user_id    
10 reply_to_screen_name
11 is_quote            
12 is_retweet          
13 favorite_count      
14 retweet_count       
15 hashtags            
16 symbols             
17 urls_url            
18 urls_t.co           
19 urls_expanded_url   
20 media_url           

View the tweets in the text column

The tweets are stored in the column/variable called text. We can review the first 10 of these entries with dplyr::select() and utils::head().

# A tibble: 10 x 1
   text                                                                   
   <chr>                                                                  
 1 RT @warriors: #DubNation, your 2018 Champs 🏆 will arrive back home thi…
 2 RT @GloballyCurry30: Greatest. Of. All. Time. #DubNation #NBAFinals #N…
 3 RT @GloballyCurry30: #DubNation #WEBACK https://t.co/brNLP7JbRh        
 4 RT @indepocrlh: Espero q para o ano ao menos a final seja mais disputa…
 5 "RT @gbrandaoc11: Another one!💙💛\n#DubNation #NBAFinals https://t.co/J…
 6 RT @warriors: #DubNation, your 2018 Champs 🏆 will arrive back home thi…
 7 RT @warriors: #DubNation, your 2018 Champs 🏆 will arrive back home thi…
 8 RT @TripleH: Three @NBA Championships out of the last four years. Cong…
 9 "RT @NBA: Coach @SteveKerr and @QCook323 share a moment as NBA champs!…
10 RT @KiannaDy: Congrats GSW!!!! 💙💛💙💛 #DubNation https://t.co/Ij6xl5ZZlq…

The timeline of tweets with rtweet::ts_plot()

The rtweet package also comes with a handy function for plotting tweets over time with rtweet::ts_plot(). I added the ggthemes::theme_gdocs() theme and made the title text bold with ggplot2::theme(plot.title = ggplot2::element_text()).

This graph shows an increase in tweets for these hashtags between June 09, 12:00 to June 09, 18:00.

Get longitude and lattitude for tweets in DubTweets

I can also add some geographic information to the twitter data (i.e. the latitude and longitude for each tweet) using the rtweet::lat_lng() function.

This function adds a lat and lng variable to the DubNtnStrngthNmbrs data frame.

I verify this with names() and tail().

[1] "profile_image_url" "query"            
[1] "lat" "lng"

I will check how many of the tweets have latitude and longitude information using dplyr::distinct() and base::nrow().

[1] 198
[1] 198

Not every tweet has geographic information associated with it, so we will not be graphing all 56,966 observations. I’ll rename lng to long so it will be easier to join to the state-level data.

Create World Map of #DubNation/#StrengthInNumbers

I will use the ggplot2::map_data() function to get the "world" data I’ll build a map with (save this as World).

Observations: 99,338
Variables: 6
$ long      <dbl> -69.90, -69.90, -69.94, -70.00, -70.07, -70.05, -70.04,...
$ lat       <dbl> 12.45, 12.42, 12.44, 12.50, 12.55, 12.60, 12.61, 12.57,...
$ group     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2...
$ order     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, ...
$ region    <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "...
$ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

The ggplot2::geom_polygon() function will create a map with the World data. The variables that build the map are long and lat (you can see why I renamed the lng variable to long in DubNtnStrngthNmbrsLoc). I added the Warrior team colors with fill and color.

Add the tweet data to the map

Now that I have a basic projection of the world, I can layer the twitter data onto the map with ggplot2::geom_point() by specifying the long and lat to x and y. The data argument also needs to be specified because we will be introducing a second data set (and will not be using the World data).

This is what’s referred to as the mercator projection. It is the default setting in coord_quickmap(). I also add the ggthemes::theme_map() for a cleaner print of the map (without ticks and axes)

The Mercator projection works well for navigation because the meridians are equally spaced (the grid lines that runs north and south), but the parallels (the lines that run east/west around) are not equally spaced. This causes a distortion in the land masses at both poles. The map above makes it look like Greenland is roughly 1/2 or 2/3 the size of Africa, when in reality Africa is 14x larger.

Mapping with the Winkel tripel projection

An alternative to the Mercator projection is the Winkel tripel projection. This map attempts to correct the distortions in the Mercator map.

This map gets added via the ggalt::coord_proj() function, which takes a projection argument from the proj4 package. I add the Winkel tripel layer with ggplot2::coord_proj("+proj=wintri") below.

This map is an ok start, but I want to add some additional customization:

  • I’ll start by adjusting the x axis manually with ggplot2::scale_x_continuous() (this gives a full ‘globe’ on the map),
  • I add the FiveThiryEight theme from ggthemes::theme_fivethirtyeight(),
  • Remove the x and y axis labels with two ggplot2::theme() statements,
  • Finally, facet these maps by the query type (#DubNation or #StrengthInNumbers)

To learn more about maps check out this document put out by the U.S. Geological Survey on map projections. The description provided in the show West Wing covers some of the distortions in the Mercator map, and this video from Vox does a great job illustrating the difficulties in rendering a sphere or globe on a 2-d surface.

Animate the timeline of tweets with gganiamte

rtweet can collect twitter data over a period of 7-10 days, but the data I have in DubNtnStrngthNmbrsLoc only ranges from "2018-06-09 07:40:22 UTC" until "2018-06-10 02:36:31 UTC".

I want to see the spread of the #DubNation and #StrengthInNumbers tweets across the globe, but I want to use the point size in this this animated map to indicate the number of followers associated with each tweet. gganimate is the ideal package for this because it works well with ggplot2.

I can start by looking at the number of followers for each account (followers_count) on the observations with location information (long and lat).

# A tibble: 10 x 2
   followers_count screen_name   
             <int> <chr>         
 1          424668 realfredrosser
 2          424666 realfredrosser
 3           30297 jgibbard      
 4           30297 jgibbard      
 5           18962 Lakeshore23   
 6           16800 coreydu       
 7           12333 billsowah1    
 8           10404 tigerbeat     
 9            8260 jeanquan      
10            8260 jeanquan      

This looks like there are a few screen_names with > 10000 followers. I can get a quick view of the distribution of this variable with qplot()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This long tail tells me that these outliers are skewing the distribution. I want to see what the distribution looks like without these extremely high counts of followers.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This still looks skewed, but now we can see more of a distribution of followers. The majority of the observations fall under 2500 followers, with a few reaching above 7500. I will remove the observations with more than 10,000 followers.

Observations: 469
Variables: 11
$ user_id          <chr> "115334561", "115334561", "115334561", "12583976...
$ status_id        <chr> "1005481675470471171", "1005481675470471171", "1...
$ screen_name      <chr> "jeanquan", "jeanquan", "jeanquan", "JackieKPIX"...
$ followers_count  <int> 8260, 8260, 8260, 7230, 7230, 7229, 7229, 7229, ...
$ friends_count    <int> 1829, 1829, 1829, 348, 348, 348, 348, 348, 1960,...
$ favourites_count <int> 366, 366, 366, 2358, 2358, 2358, 2358, 2358, 493...
$ created_at       <dttm> 2018-06-09 16:08:07, 2018-06-09 16:08:07, 2018-...
$ text             <chr> "#Sweep!  Pure #joy and #love for the #dubnation...
$ long             <dbl> -122.23, -122.23, -122.23, -122.23, -122.23, -12...
$ hashtags         <list> [<"Sweep", "joy", "love", "dubnation", "back2ba...
$ lat              <dbl> 37.792, 37.792, 37.792, 37.792, 37.792, 37.792, ...

Great! Now I will create another static Winkel tripel map before animating it get an idea for what it will look like. I start with the ggWorld2 base from above, then layer in the twitter data, this time specifying size = followers_count and ggplot2::scale_size_continuous(). The range is the number of different points, and the breaks are the cut-offs for each size.

I also remove the x and y axis labels, and add the ggthemes::theme_hc() for a crisp looking finish.

I learned a helpful tip from Daniela Vasquez over at d4tagirl to build two data frames to use for displaying the animation before and after the points start appearing. These are best built using dates just outside the range of the created_at field.

# A tibble: 1 x 4
  created_at          followers_count  long   lat
  <dttm>                        <dbl> <dbl> <dbl>
1 2018-06-09 07:43:46               0     0     0
# A tibble: 61 x 4
   created_at          followers  long   lat
   <dttm>                  <dbl> <dbl> <dbl>
 1 2018-06-10 03:00:00         0     0     0
 2 2018-06-10 03:01:00         0     0     0
 3 2018-06-10 03:02:00         0     0     0
 4 2018-06-10 03:03:00         0     0     0
 5 2018-06-10 03:04:00         0     0     0
 6 2018-06-10 03:05:00         0     0     0
 7 2018-06-10 03:06:00         0     0     0
 8 2018-06-10 03:07:00         0     0     0
 9 2018-06-10 03:08:00         0     0     0
10 2018-06-10 03:09:00         0     0     0
# ... with 51 more rows

Now I can use these two data frames to add additional layers to the animation. gganimate takes a frame argument, which is the value we want the followers_count to change over time (created_at).

The cumulative = TRUE tells R to leave the point on the map after its been plotted.

Now I have an animation that displays the tweets as they appeared in the two days following the NBA finals.

2018-06-11